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bUMMARY 


A.  Problem 

To  estimate  the  numbers  of  clusters  of  individuals  necessary  to  account 
for  their  distribution  of  test  profiles. 

B.  Background  and  Requirements 


Classification  and  predictions  of  performance  of  enlisted  men  in  A-School 
training  requires  appropriate  statistical  description  of  the  joint  distribu¬ 
tion  of  their  test  scores  and  performance  criteria.  The  usual  assumptions 
of  multivariate  normality  may  not  be  appropriate.  In  such  cases  prediction 
may  be  improved  by  clustering  the  men  into  several  groups,  each  of  which 
has  a  normal  distribution  of  scores.  The  problem  solved  by  this  research 
is  how  many  such  clusters  to  use. 

C.  Approach 


Several  hundred  random  simples  from  spherical  nors  l  distributions  were 
generated  by  a  computer  pseudo-random  number  generator.  The  samples  were 
fitted  to  one,  two,  or  three  clusters  by  the  NORM IX  procedure,  and  the 
likelihood  ratios  computed  for  alternative  hypotheses  concerning  the 
numbers  of  clusters. 

'<*** . 

D.  Findings 
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The  results  suggest  that  the  logarithm  of  the  likelihood  ratio,  when 
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>  as  chi-square  with  degrees  of  freedom  twice  the  number  of  variables  times 
the  difference  in  the  nunbers  of  hypothesized  clusters.  This  formula  has 
been  incorporated  in  the  significance  estimates  of  the  NQRMIX  360  computer 
program. 

Wm  t  , 

M&mfwU  B.  Conclusion 

mm 

•m 

Likelihood  ratios  for  mixture  problems  are  not  distributed  as  chi-square 
with  degrees  of  freedom  equal  to  the  number  of  variables;  instead  doubling 
‘  the  degrees  )£  freedom  seems  to  give  a  better  fit  to  the  sampling  distri- 

but  ion. 

afeuv:-; 

F.  Recommendations 

The  formula  given  in  this  paper  should  be  used  with  caution  as  a  guide¬ 
line  in  estimating  the  number  of  clusters  in  a  sample,  (p.  4) 
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A  MONTE  CARLO  STUDY  OF  THE  SAMPLING  DISTRIBUTION  OF  THE 
LIKELIHOOD  RATIO  FOR  MIXTURES  OF  MULTINORMAL  DISTRIBUTIONS 

I .  INTRODUCTION 

A  previous  paper  (Wolfe,  1970)  presented  a  maximum- likelihood  estimation 
procedure  for  mixtures  of  distributions.  The  method  tries  to  fit  the  data 
to  a  distribution  which  is  composed  of  a  mixture  of  a  hypothesized  number 
of  component  distributions.  The  obtained  likelihood  is  a  measure  of  the 
degree  of  fit.  The  (null)  hypothesis  of  r  clusters  can  be  compared  with 
the  hypothesis  of  r'  >  r  clusters  by  computing  the  likelihood  ratio 
X  *  Lr/Lr^.  This  ratio  should  provide  all  the  information  necessary  to 
test  the  hypothesis  of  r  clusters  against  the  alternative  r'  clusters, 
provided  we  know  the  sampling  distribution  of  the  likelihood  ratio  under 
the  null  hypothesis. 

Wilks  (1938)  showed  under  certain  regularity  conditions  that  -2  log  X 
is  asymptotically  distributed  as  chi-square  with  degrees  of  freedom  equal 
to  the  difference  in  the  number  of  parameters  between  the  restricted  and 
unrestricted  hypotheses.  Hogg  (1956)  proyed  under  certain  conditions 
where  the  range  of  the  parent  distribution  is  a  function  of  the  parameters 
that  -2  log  X  is  distributed  exactly  as  chi-square  with  degrees  of  freedom 
equal  to  twice  the  difference  in  the  number  of  parameters.  Bartlett  (1947) 
investigated  the  problem  of  testing  for  equality  of  r  means  in  multivariate 
analysis  of  variance.  He  improved  Wilks'  result  for  small  samples  by  using 

X2  *  -2  C  log  X, 

where  C  =  (N-l-  m^-  ) 

degrees  of  freedom  =  m(r'-l), 

m  =  number  of  variables,  and  N  =  sample  size. 

In  a  previous  paper  (Wolfe,  1970)  Wilks'  formula  performed  poorly  in 
testing  the  number  of  components  in  the  Fisher- Iris  problem  and  in  the 
"Artificial  Clusters"  problem.  In  each  case  the  Wilks'  test  rejected  the 
null  hypothesis  when  it  was  true. 
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A  little  reflection  indicates  several  points  where  the  conditions  are 

not  s  ttisfied  for  Wilks'  theorem  to  hold.  Wilks  assumes  that  the  null 

hypothesis  defines  a  parameter  subspace  urCQr<  consisting  of  points  of 

the  form  (0.,...,  0.,  0^.,  where  0„Al  . ...»  0_^  have  fixed 

l  r  r+i,o  r,o  r+i,o  r,o 

values  and  lie  in  the  interior  of  some  open  set  where  the  likelihood 
function  has  a  unique  maximum.  In  the  mixture  problem,  however,  the  null 
hypothesis  is  that  the  mixing  proportions  *r+j,  1,r+2»*,,»  Rre  e(lual 
to  zero,  which  is  at  the  boundary  of  a  closed  set  [0,1].  When  the  mixing 
proportions  are  zero  the  corresponding  means  cannot  be  estimated  since  the 
likelihood  function  is  completely  flat,  i.e.  unchanged  for  different  values 
of  those  means.  The  probability  density  function  of  r'  types  involves 
r'(m+l)  parameters.  For  each  of  the  t'  types  there  is  one  parameter  for 
the  mixing  proportion  and  m  parameters  for  the  means  of  that  type.  Never¬ 
theless  the  comparison  of  t'  against  r'-l  types  can  be  accomplished  by 
imposing  only  one  restriction  that  v  0.  Alternatively,  m  constraints 
can  be  imposed  on  the  means  so  that  two  types  have  the  sar.e  means.  In 
this  case,  it  is  impossible  to  estimate  the  relative  proportions  of  the 
two  types  since  the  likelihood  function  will  be  flat  for  *  „♦  « 

constant . 

II.  METHOD 

The  present  paper  is  concerned  with  a  Monte-Carlo  investigation  of  the 
sample  distribution  of  -2C  log  X  for  mixtures  of  normal  distributions  when 
the  null  hypothesis  is  true  that  the  "mixture"  contains  only  one  component. 

The  pseudo-random  normal  deviate  generator  used  in  this  study  con¬ 
sisted  of  the  Lewis,  Goodman,  and  Miller  (1969)  subroutine  for  uniform 
random  variables  followed  by  the  IBM  (1968)  subroutine  NDTRI  for  producing 
the  inverse  of  the  normal  distribution  function. 

Using  this  normal  deviate  generator,  samples  from  spherical  normal 
univariate,  bivari ate,  and  22-variate  distributions  were  produced.  The 
sample  sizes  were  100,  100,  and  113,  respectively.  One  hundred  univariate, 
one  hundred  bivariate,  and  one  hundred  22- variate  samples  were  generated. 
These  samples  were  run  through  the  360  N0RMIX  computer  program  (Wolfe, 1971) 
to  obtain  likelihoods  for  hypotheses  of  one  type,  two  types,  and  three 
types,  assuming  the  types  share  a  co  on  covariance  matrix. 
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On  several  staples  the  likelihoods  failed  to  increase  when  the  number 
of  types  increased,  apparently  because  the  computer  converged  on  a  sub- 
optimal  relative  maximum  in  the  likelihood  function.  When  these  samples 
were  re-run  with  different  initial  estimates  many  of  them  converged  in 
solutions  with  greater  likelihoods.  The  remaining  samples  which  did  not 
increase  in  likelihood  after  three  tries  were  omitted  from  the  analysis 
except  in  the  calculation  of  the  median  likelihood  ratios. 

III.  RESULTS 

The  results  of  the  Monte  Carlo  study  are  presented  in  Table  1.  The 
function  tabulated  is  the  same  as  Bartletts'  formula  except  that  the 
number  of  variables  is  doubled  in  computing  the  coefficient  C. 

It  is  evident  that  if  -2  C  log  X  is  to  be  fitted  to  a  chi-square 
distribution,  the  degrees  of  freedom  will  have  to  be  approximately  twice 
the  number  of  variables,  m. 

Table  2  gives  the  percentage  frequencies  of  the  corresponding  chi- 
square  probabilities  of  -2  C  log  X  with  degrees  of  freedom  2m.  The 
distribution  is  approximately  uniform,  indicating  that  this  chi-square 
approxH  iition  gives  a  good  fit  to  the  sampling  distribution  of  the  likeli¬ 
hood  ratios. 

IV.  CONCLUSION 

The  data  from  this  Monte  Carlo  study  are  more  than  sufficient  to 

reject  the  Wilks'  test  for  application  to  mixture  problems.  They  are  not 

sufficient  to  establish  the  actual  sampling  distribution  of  the  likelihood 

ratio;  indeed  no  empirical  method  can  do  this.  However,  we  can  conjecture 
1  r  ^ 

that  -2  jj-  (N-l-m-  j-)  log  Lr/Lr*  is  distributed  asymptotically  as 
chi-square  with  degrees  of  freedom  *  2m(r'-r) .  This  conjecture  seems  to 
provide  the  best  available  guideline  for  testing  the  number  of  types  in  a 
mixture. 
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TABLE  1 


Adjusted  Likelihood 

Ratios 

for  Random  Normal 

Data 

Number  of  Variables 

1 

2 

22 

Sample  Size 

100 

100 

113 

Number  of  Samples 

100 

100 

25 

Number  of  Samples  Retained 

for  Analysis* 

81 

97 

25 

-2  C  log  Lj/L2** 


Median 

1.22 

3.57 

44 

Mean 

2.37 

3.99 

43.02 

Standard  Deviation 

2.54 

2.44 

6.79 

Minimum 

o 

o 

.44 

31.23 

Maximum 

9.78 

12.51 

58.42 

-2  C  log  L2/L3 


Median 

1.21 

4.03 

48 

Mean 

2.23 

4.53 

45.17 

Standard  Deviation 

2.26 

3.33 

10.14 

Minimum 

o 

o 

.04 

19.20 

Maximum 

9.88 

17.06 

60.89 

*  Only  those  cases  where  Lr+2  >  Ly+j  >  Lr  were  retained  for  analysis, 
the  others  being  considered  suboptimal  solutions. 

**  C  =  ^  (N-l-^m2r  ),  where  m  »  number  of  variables 

t'=  number  of  types  in  the  unrestricted 
hypothesis 

N  =  sample  size 


4 


TABLE  2 


Percentage  Frequencies  of  Chi-Square  Probabilities  for  Random  Normal  Ratio 


P(L2/Li) 

P(L3/L2) 

Class 

Interval 

1 

Variable 

2 

Variables 

22 

Variables 

1 

Variable 

2 

Variables 

22 

Variables 

O 

• 

1 

o 

o 

16 

6 

4 

15 

18 

12 

o 

1 

• 

iNj 

o 

8 

14 

8 

15 

11 

16 

.20-  .30 

4 

16 

8 

7 

9 

12 

.30-  .40 

12 

10 

16 

10 

12 

16 

.40- .50 

10 

8 

8 

4 

4 

8 

.50-  .60 

9 

9 

8 

10 

11 

4 

.60-  .70 

10 

9 

16 

4 

3 

4 

o 

00 

• 

1 

o 

t-. 

15 

9 

12 

6 

13 

12 

00 

o 

1 

• 

<o 

o 

9 

13 

16 

16 

4 

8 

.90-1.00 

7 

6 

4 

13 

15 

8 
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